Overall, the goal here was it to explore the data set and reduce it to run the subsequent ATAC-Seq Quality Control steps (ATAC-Seq-QC) locally.
These are the packages required for the code below.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(binom)
library(lubridate)
library(broom)
## Warning: Paket 'broom' wurde unter R Version 4.4.3 erstellt
library(readxl) #for loading excel files
## Warning: Paket 'readxl' wurde unter R Version 4.4.3 erstellt
library(ggsci)
library(Seurat) #for RNA-seq and ATAC-seq data
## Lade nötiges Paket: SeuratObject
## Warning: Paket 'SeuratObject' wurde unter R Version 4.4.3 erstellt
## Lade nötiges Paket: sp
##
## Attache Paket: 'SeuratObject'
##
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, t
All data was sourced from: Lin X, Yang X, Chen C, et al. Single-nucleus chromatin landscapes during zebrafish early embryogenesis. Sci Data. 2023;10(1):464. doi:10.1038/s41597-023-02373-y . It was kindly compiled and pre-processed by Lauren Saunders.
First, the data was loaded into R using the code below.
readRDS("Data/Lin_2023_zfish_snATAC-seq/Cell-type_Peak_Matrix.rds") -> zfish_snATAC_seq_pk_mtrx #count matrix
read_csv("Data/Lin_2023_zfish_snATAC-seq/atac_all.metaData.csv")-> zfish_mta_data # meta data
## New names:
## Rows: 50637 Columns: 24
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): ...1, Sample, stage, Clusters, celltype, predictedCell, predictedG... dbl
## (17): TSSEnrichment, ReadsInTSS, ReadsInPromoter, ReadsInBlacklist, Prom...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
read_csv("Data/Lin_2023_zfish_snATAC-seq/SupplT2.csv")-> zfish_snATAC_seq_cell_type_mrkrs #markers for cell types
## New names:
## Rows: 8684 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): group_name, seqnames, name dbl (9): ...1, group, start, end, strand, idx,
## Log2FC, FDR, MeanDiff
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Then, the count matrix was explored:
class(zfish_snATAC_seq_pk_mtrx) #what class does the object have
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(zfish_snATAC_seq_pk_mtrx) #gives dimensions
## [1] 370058 50637
zfish_snATAC_seq_pk_mtrx[1:10, 1:2]
## 10 x 2 sparse Matrix of class "dgCMatrix"
## 24hpf_1#24hpf_1_BC00224_N06 24hpf_1#24hpf_1_BC00014_N03
## chr1:5232-5732 . .
## chr1:5787-6287 . .
## chr1:10088-10588 . .
## chr1:10991-11491 . .
## chr1:11895-12395 . 2
## chr1:12474-12974 1 .
## chr1:14016-14516 . .
## chr1:14704-15204 1 .
## chr1:16672-17172 3 .
## chr1:18404-18904 3 .
The object zfish_snATAC_seq_pk_mtrx is a compressed
matrix consisting of 370058 rows, whose names represent the locations on
the chromosome (“peak”), and 50637 columns, whose names represent all
single cell samples, which were collected. The sample names include
information about how many hours post fertilisation (hpf) the embryos
were collected.
The data frame zfish_snATAC_seq_pk_mtrx contains the
meta data information for every sample of the count matrix, such as cell
type, reads in peaks and many more.
class(zfish_mta_data)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(zfish_mta_data)
## [1] 50637 24
zfish_mta_data
The data frame zfish_snATAC_seq_cell_type_mrkrs contains
further information on all peaks, which could have been used to assign
cell types.
class(zfish_snATAC_seq_cell_type_mrkrs)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(zfish_snATAC_seq_cell_type_mrkrs)
## [1] 8684 12
zfish_snATAC_seq_cell_type_mrkrs
Additionally the two tables below were loaded and explored.
Supplementary table 1 consists mostly of descriptions of the isolation of each sample.
read_excel("Data/Lin_2023_zfish_snATAC-seq/SupplT11 summary table of data deposited.xlsx") #further experiment data Supplementary Table 1
## New names:
## • `` -> `...10`
Supplementary table 3 contains further information on large fraction of the peaks.
read_csv("Data/Lin_2023_zfish_snATAC-seq/SupplT13-all developmental stages intergrated cell type marker of scRNA-seq.csv") #further experiment data Supplementary Table 3
## New names:
## Rows: 28907 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, gene dbl (6): p_val, avg_log2FC, pct.1, pct.2, p_val_adj, cluster
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Next, I reduced the count matrix
zfish_snATAC_seq_pk_mtrx and meta data
zfish_mta_data to perform QC steps locally before applying
them to the data sets containing all samples and all peaks. To achieve
this, I selected cell types Lauren Saunders deemed interesting and
sampled 1000 peaks randomly.
## subset cells for 5 celltypes
unique(zfish_mta_data$celltype)
## [1] "blastomere" "YSL/presumptive endoderm"
## [3] "EVL" "epiblast"
## [5] "hypoblast" "YSL"
## [7] "lateral plate mesoderm" "anterior/posterior axis"
## [9] "neural keel" "neural crest"
## [11] "periderm/epidermis" "segmental plate"
## [13] "UND" "mesenchyme cell"
## [15] "central nervous system" "integument"
## [17] "musculature system" "primary neuron"
## [19] "immature eye" "forebrain"
## [21] "erythroid lineage cell" "digestive system"
## [23] "neural stem cell"
## counting cells
zfish_mta_data %>%
group_by(celltype) %>%
tally() %>%
arrange(-n) %>%
filter(celltype %in% c("neural crest", "periderm/epidermis", "lateral plate mesoderm", "primary neuron","YSL/presumptive endoderm")) %>% #one from each germ layer plus neural crest and neurons
mutate(sum(n)) #testing if the cell types add up to ~ 5000
## filter selected celltypes
sub_mta_data.df = zfish_mta_data %>%
filter(celltype %in% c("neural crest", "periderm/epidermis", "lateral plate mesoderm", "primary neuron","YSL/presumptive endoderm"))
##selecting 1000 peaks randomly
sub_pk_mtrx = zfish_snATAC_seq_pk_mtrx[sample(rownames(zfish_snATAC_seq_pk_mtrx), 1000), c(unique(sub_mta_data.df$...1))]
##testing if reduction was successful
dim(sub_pk_mtrx)
## [1] 1000 5514
sub_pk_mtrx[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names '5hpf_1#5.25hpf_1_BC0021_N02', '5hpf_1#5.25hpf_1_BC0012_N01', '5hpf_1#5.25hpf_1_BC0018_N01' ... ]]
##
## chr12:575791-576291 . . . . . . . . . .
## chr20:8554541-8555041 . . . . . . . . . .
## chr21:19062227-19062727 . . . . . . . . . .
## chr5:40786396-40786896 . . . . . . . . . .
## chr17:53328810-53329310 . . . . . . . . . .
## chr14:21426958-21427458 . . . . . . . . 2 .
## chr18:10952460-10952960 . . . . . . . . . .
## chr23:21458886-21459386 . . . . . . . . . .
## chr17:44203552-44204052 . . . . . . . . . .
## chr2:56201692-56202192 . . . . . . . . . .
There are around 5000 cells after subsetting cells of 5 cell types from the data set.
Lastly, as preparation for ATAC-Seq-QC on the reduced data set, I created a Seurat Object containing the reduced count matrix and meta data and saved it as an RDS file.
# creating Seurat Object
sub_zfish_atac=CreateSeuratObject(counts = sub_pk_mtrx, assay = "atac", meta.data = sub_mta_data.df)
## Warning: Setting row names on a tibble is deprecated.
#save as RDS file
saveRDS(sub_zfish_atac, file = "Data/Lin_2023_zfish_snATAC-seq/sub_zfish_atac.rds")
I also installed the signac package. It was created by the Stuart Lab. Signac is a package, which was specially developed for ATAC-seq data sets.
setRepositories(ind=1:3) # needed to automatically install Bioconductor dependencies
install.packages("Signac")
sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Seurat_5.2.1 SeuratObject_5.0.2 sp_2.2-0 ggsci_3.2.0
## [5] readxl_1.4.5 broom_1.0.8 binom_1.1-1.1 lubridate_1.9.4
## [9] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_1.8.9
## [4] magrittr_2.0.3 spatstat.utils_3.1-2 farver_2.1.2
## [7] rmarkdown_2.29 vctrs_0.6.5 ROCR_1.0-11
## [10] spatstat.explore_3.3-4 htmltools_0.5.8.1 cellranger_1.1.0
## [13] sass_0.4.10 sctransform_0.4.1 parallelly_1.43.0
## [16] KernSmooth_2.23-26 bslib_0.9.0 htmlwidgets_1.6.4
## [19] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
## [22] zoo_1.8-12 cachem_1.1.0 igraph_2.1.4
## [25] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.7-1 R6_2.6.1 fastmap_1.2.0
## [31] fitdistrplus_1.2-2 future_1.40.0 shiny_1.10.0
## [34] digest_0.6.37 colorspace_2.1-1 patchwork_1.3.0
## [37] tensor_1.5 RSpectra_0.16-2 irlba_2.3.5.1
## [40] progressr_0.15.1 spatstat.sparse_3.1-0 timechange_0.3.0
## [43] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
## [46] compiler_4.4.2 bit64_4.6.0-1 withr_3.0.2
## [49] backports_1.5.0 fastDummies_1.7.5 MASS_7.3-64
## [52] tools_4.4.2 lmtest_0.9-40 httpuv_1.6.15
## [55] future.apply_1.11.3 goftest_1.2-3 glue_1.8.0
## [58] nlme_3.1-167 promises_1.3.2 grid_4.4.2
## [61] Rtsne_0.17 cluster_2.1.8 reshape2_1.4.4
## [64] generics_0.1.3 gtable_0.3.6 spatstat.data_3.1-6
## [67] tzdb_0.4.0 data.table_1.16.4 hms_1.1.3
## [70] spatstat.geom_3.3-5 RcppAnnoy_0.0.22 ggrepel_0.9.6
## [73] RANN_2.6.2 pillar_1.10.2 vroom_1.6.5
## [76] spam_2.11-1 RcppHNSW_0.6.0 later_1.4.1
## [79] splines_4.4.2 lattice_0.22-6 bit_4.6.0
## [82] survival_3.8-3 deldir_2.0-4 tidyselect_1.2.1
## [85] miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.50
## [88] gridExtra_2.3 scattermore_1.2 xfun_0.52
## [91] matrixStats_1.5.0 stringi_1.8.4 lazyeval_0.2.2
## [94] yaml_2.3.10 evaluate_1.0.3 codetools_0.2-20
## [97] cli_3.6.1 uwot_0.2.2 xtable_1.8-4
## [100] reticulate_1.40.0 munsell_0.5.1 jquerylib_0.1.4
## [103] Rcpp_1.0.14 globals_0.17.0 spatstat.random_3.3-2
## [106] png_0.1-8 spatstat.univar_3.1-1 parallel_4.4.2
## [109] dotCall64_1.2 listenv_0.9.1 viridisLite_0.4.2
## [112] scales_1.3.0 ggridges_0.5.6 crayon_1.5.3
## [115] rlang_1.1.4 cowplot_1.1.3